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Abstract: An optimal index solving top-A; document retrieval [Navarro and 
Nekrich, SODA'12] takes 0(m + k) time for a pattern of length m, but its 
space is at least 80n bytes for a collection of n symbols. We reduce it to 1.5n- 
3n bytes, with 0(m+(/c+loglogn) log logn) time, on typical texts. The index is 
up to 25 times faster than the best previous compressed solutions, and requires 
at most 5% more space in practice (and in some cases as little as one half). 
Apart from replacing classical by compressed data structures, our main idea is 
to replace suffix tree sampling by frequency thresholding to achieve compression. 

1 Introduction 

Finding the k most documents relevant to a query is at the heart of search engines 
and information retrieval [23]. A simple relevance measure is the number of occur- 
rences of the query in the documents {term frequency) . Typically the data structure 
employed to solve those "top-fc" queries is the inverted index. Inverted indexes work 
well, but they are limited to scenarios where the queryable terms are predefined and 
not too many (typically "words" in Western languages), while they cannot search for 
arbitrary patterns (i.e., substrings in the sequences of symbols). This complicates the 
use of inverted indexes for Oriental languages such as Chinese, Japanese and Korean, 
for agglutinating languages such as Finnish and German, and in other types of collec- 
tions containing DNA and protein sequences, source code, MIDI streams, and other 
symbolic sequences. Top-A; document retrieval is of interest on those more general 
sequence collections [T71 [TTJ, [191 ETJ , yet the problem of finding top-k documents con- 
taining the pattern as a substring, even with a simple measure like term frequency, 
is much more challenging. 

The general problem can be defined as follows: Preprocess a collection of d docu- 
ments containing sequences of total length n over an alphabet of size a, so that later, 
given a query string P of length m, one retrieves k documents with highest relevance 
to P, for some definition of relevance. Hon et al. [121 E] presented the first efficient 
solution for this problem, achieving 0(m + logn log logn + k) time, yet with super- 
linear space usage, 0(n log 2 n) bits. Then Hon et al. [UJ improved the solution to 
0(m + k log k) time and linear space, 0(n logn) bits. Recently, Navarro and Nekrich 
[19] achieved optimal 0{m + k) time, using 0(n(logcr+log<i)) bits. Although the lat- 
ter solution essentially closes the problem in theoretical terms, the constants involved 
are not small, especially in space: Their index can use up to 80n bytes, making it 
unfeasible for real scenarios. 

*Funded by Fondecyt grant 1-110066 and by the Conicyt PhD Scholarship Program, Chile. 
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There has been some work aiming to reduce the space of top-k indexes [2H 151 l2"Tj. 
yet they come at the cost of search times of at least 0(m + k log k Iog 1+e n) for any 
constant e > 0, while reaching as low as nloga + 0(n log log logo?) bits of space (all 
our logarithms are in base 2). In practice, the best ones [21] require 2n to 4n bytes and 
answer top-10 queries in about a millisecond. Their main idea is suffix tree sampling, 
that is, store the top-A; answers for large enough suffix tree nodes. 

Hon et al. [TO] have proposed an intermediate alternative, which is basically an 
engineered implementation of their classical scheme They use (logcr+2 logG?)(n+ 
o{n)) bits and 0{m + /clog log n + (log log n) 4 ) time, or (logcr + log of) (n + o(n)) bits 
and 0(m + ^(logoToglogn) 14 ^ + (log log n) 6 ) time. This solution has not been yet 
implemented, however. We estimate their space would be at least 4n to Qn in practice. 

In this work we design and implement a fast and compact solution for top-fc 
document retrieval, building on the ideas of Navarro and Nekrich [19]. Apart from 
replacing classical by compact data structures, we use a novel idea of frequency thresh- 
olding instead of sampling suffix tree nodes: We store all the solutions for all the suffix 
tree nodes, but discard those with frequency 1. 

We obtain time 0(m+(fc+loglogn) log logn) and space (log cx+log d+4 log log n) {n+ 
o(n)) bits for typical texts. By "typical" we mean that our results hold almost surely 
(a.sQ a very strong kind of convergence) for texts sampled from a stationary mixing 
ergodic source (more precisely, type A2 in Szpankowski's sense [26J). This is also a 
quite general assumption including Bernoulli and Markovian models. 

In addition, we have implemented our index, showing its practicality. It turns 
out to require about l.5n-3n bytes, that is, 25-50 times less than a naive implemen- 
tation of the basic idea [TO] and at most 5% more space than the most compressed 
practical solutions [21] (while in some cases our index uses half the space). Its time 
per query is k-Ak microseconds, outperforming the more compressed solutions by up 
to 25 times. This is the first top-k index for general texts that achieves little space 
and microseconds-time. Moreover, it shows that our idea of thresholding frequencies 
generally gives better results than the previous trend of sampling suffix tree nodes. 

2 Basic Concepts 

Consider a collection of string documents Di as the concatenation T[l, n] = D%, D2 . . . 
Dd-, T = t\t2 ■ ■ ■ t n , where at the end of each D{ a special symbol $ is used to mark 
the end of that document. A suffix array p3] iSA^n] contains pointers to every 
suffix of T, lexicographically sorted. For a position i G [l,n], SA[i] points to the 
suffix T^/l^], n] = tsA[i]tsA[i]+i ■ ■ - tn, where it holds T[S71[i],ra] < T[S^4[2 + l],n]. 
The occ = ep — sp + 1 occurrences of a pattern P[l, m] in T are pointed from a range 
SA[sp, ep], that can be found and listed in time 0{m log n + occ). 

The suffix tree [27] of T is a path-compressed trie (i.e., unary paths are collapsed) 
in which all the suffixes of T are inserted. Internal nodes correspond to repeated 
strings of T and the leaves correspond to suffixes. For internal nodes v, path(v) is 

1 A sequence X n tends to a value /3 almost surely if, for every e > 0, the probability that |Xjv//3 — 
1| > e for some N > n tends to zero as n tends to infinity, limn^oo sup Ar> „ ~Pt(\Xn//3 — 1| > e) = 0. 
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the concatenation of the edge labels from the root to v. The suffix tree finds the 
occurrences of P in T in time 0(m + occ), by traversing it from the root to the 
locus of P, i.e., the highest node v such that P is a prefix of path(v). Then all the 
occurrences of P correspond to the leaves of the subtree rooted at v. These leaves 
correspond to the range SA[sp,ep], indeed, v is the lowest common ancestor of the 
spth and the epth leaves. The suffix tree has 0(n) nodes. 

Compressed suffix arrays (CSAs) [TS] can represent the text and its suffix array 
within essentially nHk(T) < rilogcr bits. Here Hk(T) is the empirical A;th order 
entropy of T [15], a lower bound to the bits per symbol emitted by a statistical 
compressor of order k. This representation allows us to count (determine the interval 
[sp, ep] corresponding to a pattern P), access (compute SA[i] for any i), and extract 
(rebuild any T[l,r}). We use one [2] that can count in time 0(m), access in time O(s) 
and extract in time 0(s + r — I) while using nHk(T) + o(nHk(T)) +0(n + (n/s) logn) 
bits for any k < a\og a n and any constant < a < 1. The 0({n/s) logn) bits 
correspond to storing one SA [i] value every s text positions. 

General trees of n nodes can be represented using 2n + o(n) bits. In this paper we 
use a representation |25j that supports in 0(1) time a number of operations, including 
preorder(v) (the preorder of node v), preorderselect(i) (the ith node in preorder), 
depth{v) (depth of node v), subtreesize(v) (number of nodes in subtree rooted at v), 
lca(u, v) (lowest common ancestor of nodes u and v), and many others. This structure 
is practical and implemented pQ, using 2.37n bits. 

Bitmaps B[l,n] can be represented using n + o(n) bits, so that we can solve 
in constant time operations rankb(B,i) (number of occurrences of bit b in 5[l,i]) 
and selectb(B>,j) (position in B of the jth occurrence of bit b) [IE] . We use an 
implementation [7] that requires 1.05n bits, yet for very sparse bitmaps (with m «n 
bits set) we prefer a compressed one using mlog(n/m) + 2m bits |22j . 

Range Maximum Queries (RMQs) ask for the position of the maximum element 
in a range of an array, RMQ j4 (i, j) = axgma.x i<k< j A[k}. They can be solved in constant 
time after preprocessing A and storing a structure using 2n + o(n) bits. No accesses 
to A are needed at query time [6]. The solution requires lea queries on a tree called 
a "2d-min-heap" , and we implement it over our compact trees [25] . 

Direct Access Codes [I] represent a sequence of variable-length numbers by packing 
them into chunks of length b. Then the chunks are rearranged to allow one accessing 
any £-bit number in the sequence in time 0(£/b). The space overhead for a number of 
I bits is Ijb + b. We use their implementation, which chooses optimally the b values. 

Wavelet trees [8] can be used to represent an n x r grid that contains n points, 
one per column [13J. The root represents the sequence of coordinates of the points 
in x-coordinate order. It only stores a bitmap B[l, n] telling at B[i] whether < r/2 
or not. Then the points with yj < r/2 are represented, recursively, on the left child of 
the root, and the others on the right. Adding rank capabilities to the bitmaps, the 
wavelet tree requires overall nlogr(l + o(l)) bits and can track any point towards its 
leaf (where the y^ value is revealed) in time O(logr). It can also count, in O(logr) 
time, the number of points lying inside a rectangle [xi,x 2 ] x [yi,y 2 ]: Start at the root 
with the interval [xi,^] and project those values towards the left and right child (on 
the left child the interval is [rank Q (B, x\ — 1) + 1, ranko(B, x%)\, and similarly with 
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ranki on the right). This is continued until reaching the O(logr) wavelet tree nodes 
that cover [yx, y 2 ). Then the answer is the sum of the lengths of the mapped intervals 
One can also track those points toward the leaves and report them, each in 
time O(logr). We use a simple balanced wavelet tree without pointers [5]. 

Muthukrishnan's algorithm [17] for listing the distinct elements in a given interval 
of an array A[l,n] uses another array C[l,n] where C[i] = max{j < i, A[j] = 
A[i]} U { — 1}, which is preprocessed for range minimum queries. Each value C[m] < i 
for i < m < j is a distinct value A[m] in j]. A range minimum query in C[i, j] 
gives one such value m, and then we continue recursively on A[i, m — 1] and A[m + 1, j] 
until the minimum is > i. One retrieves any k unique elements in time O(k). 

3 The Optimal-Time Linear-Space Solution 

Our implementation is based on the framework proposed by Hon, Shah and Vitter 
[TT] and then followed by Navarro and Nekrich [19] : Let T be the suffix tree for the 
concatenation T of a collection of documents Dx, . . . , Dj. This tree contains the nodes 
corresponding to all the suffix trees % of the documents D^. For each node u e 71, 
there is a node v G T such that path{v) = path{u). We will say that v = map(u,i). 
Also, let parent{u) be the parent of a node u and depth{u) be its depth. 

They store T plus additional information on the trees %. If v = map(u,i), then 
they store i in a list called F-list associated to v. Further, for each v = map(u, i) they 
store a pointer ptr(v,i) = map(parent(u),i), noting where the parent of u maps in 
T. We add a dummy root p to T so that ptr{v, i) = p if u is the root of %. 

Together with the pointers ptr(v,i) they also store a weight w(v,i), which is the 
relevance of path(u) in Di. This relevance can be any function that depends on the 
set of starting positions of path(u) in D,. In this paper we focus on a simple one: the 
number of leaves of u in 7~i, that is, the term frequency. 

Let v be the locus of P. Hon et al. [H] prove that, for each distinct document Di 
where P appears, there is exactly one pointer ptr(v", i) = v' going from a descendant 
v" of v (v itself included) to a (strict) ancestor v' of v, and w(v",i) is the relevance 
of P in Di. Therefore, they find the k largest w values in this set. 

Navarro and Nekrich [TH] represent this structure as a grid of size 0(n) x 0(n) 
with labeled weighted points, as follows. They traverse T in preorder. For each node 
v G T, and for each pointer ptr(v, i) = v', they add a new rightmost x-coordinate with 
only one point, with y-coordinate equal to depth(v'), weight equal to w(v , i), and label 
equal to i. At query time, they find the locus v of P, determine the range [xi,x 2 ] of 
all the ^-coordinates filled by v or its descendants, find the k highest-weighted points 
in [xi,^] x [0, depth(v) — 1], and report their labels. A linear-space representation 
(yet with a large constant) allows them to carry out this task in time 0(m + k). 

4 Our Compressed Representation 

We describe our compressed data structures we use and how we carry out the search. 
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Suffix tree. We use a CSA [2] requiring nH k (T) + o(nH k (T)) + 0(ri) bits, which 
computes [sp, ep] corresponding to P in time 0{m). It also computes any SA[i] in 
time O (log log n). For this sake we use a sampling every log log n positions. In the 
samples we store not the exact position in T but just the document where it lies. 
Hence we need 0(nlog<i/loglogn) = o(n log d) bits for the sampling. 

In practice, we use an off-the-shelf CSA (SSA from PizzaChili site, http : / / 
pizzachili . dec . uchile . cl), and add a sparse bitmap D[l,n] marking where 
documents start in T: the document corresponding to SA[i] is rank\{D, SA\i\), com- 
puted in time O (log log n). While this is worse than having the CSA directly return 
documents, it retains our CSA other pattern matching functionalities. 

We also add 2n + o{n) bits to describe the topology of the suffix tree, using a tree 
representation that carries out most of the operations in constant time [22] • Note this 
is just the topology, not a full suffix tree, so we need to search using the CSA. 

We also add 2n + o(n) bits for an RMQ structure on top of Muthukrishnan's array 
C (TTj, which can list k distinct documents in any interval SA[sp, ep] in time O(k). 

Mapping to the grid. The grid is of width £\ \%\ < 2n, as we add one coordinate 
per node in the suffix tree of each document. To save space, we will consider a virtual 
grid just as defined, but will store a narrower physical grid. In the physical grid, the 
entries corresponding to leaves of T (which contain exactly one pointer ptr(v, i)) will 
not be represented. Thus the physical grid is of width at most n. This frequency 
thresholding is a key idea, as it halves the space of most structures in our index. 

Two bitmaps will be used to map between the suffix array, the suffix tree, and 
the virtual and physical grids: £?[l,2n] and L[l, 2n]. Bitmap B will mark starting 
positions of nodes of T in the physical grid: each time we arrive at an internal node 
v we add a 1 to B, and each time we add a new x-coordinate to the grid (due to a 
pointer ptr(v, i)) we add a to B. Bitmap L will mark leaves in the preorder traversal 
of T, using a 1 for leaves and a for internal nodes. 

Representing the grid. In the grid there is exactly one point per x-coordinate. 
We represent with a wavelet tree [8] the sequence of corresponding y-coordinates. 
Note that the height of this grid is cAogn for some constant c a.s. [2HI Thm. 1(h) 
and Remark 2(iv)]. Thus, the height of the wavelet tree is log log n + 0(1) and the 
wavelet tree requires nloglogn(l + o(l)) bits in total, a.s. (from now on we will omit, 
except in the theorems, that our results hold almost surely and not in the worst case). 

Each node v of the wavelet tree represents a subsequence of the original sequence 
of ^-coordinates. We consider the (virtual) sequence of the weights associated to the 
points represented by v, W(v), and build an RMQ data structure [B] for W(v). This 
structure requires 2|W(i;)| +0(\W(v)\/logn). This adds up to 2nloglogn(l + o(l)) 
for the whole wavelet tree. 

Representing labels and weights. The labels of the points, that is, the document 
identifiers, are represented directly as a sequence of at most ra|~logcf] = nlog d + 0{n) 
bits, aligned to the bottom of the wavelet tree. Given any point to report, we descend 
to the leaf in O(loglogn) time and retrieve the document identifier. 

The weights are stored similarly, but using direct access codes jl] to take advantage 
of the fact that most weights (term frequencies) are small. Note that the subtree size 
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of each % internal node will be stored exactly once as the weight of some ptr(v, i). 

We analyze now that the number of bits required to store those numbers. Let 
rii = \Di\. Since the height of any % is O(logrij), so is the depth of any node. The 
sum of the depths of all the nodes is then 0(rii logrij), and this is also the sum of all 
the subtree sizes. Distributing those sizes over the rij nodes uniformly (which gives a 
pretty pessimistic worst case for the sum of the logarithms) gives O(logrij) for each. 
Thus the number of bits required to represent the sizes is at most loglogrij + 0(1) < 
log log n + 0(1). Using direct access codes with block size b = y/\og log n poses an 
extra overhead of 0(\/log logri) = o(loglogn) bits. Hence all the weights can be 
stored in nloglogn(l + o(l)) bits and accessed in time O^log logn)]^] 

Answering queries. The first step to answer a query is to use the CSA to deter- 
mine the range [sp, ep] in time 0(m). To find the locus v of P in the topology of 
the suffix tree, we compute I and r, the spth and epth leaves of the tree, respectively, 
using I = preorderselect(selecti(L, sp)) and r = preorderselect(selecti(L,ep)), and 
then we have v = lca(l, r). All those operations take 0(1) time. 

To determine the horizontal extent [xi,^] of the grid that corresponds to the 
locus node v, we first compute pi = preorder{v) and pi = p± + subtreesize{v). This 
gives the preorder range [pi,P2) including leaves. Now l\ = rank\(L,pi) and I2 = 
ranki (L, p 2 — 1) gives the number of leaves up to those preorders. Then, since we have 
omitted the leaves in the physical grid, we have X\ = selecti(B,p 1 — Z x ) — (pi — li) + 1 
and X2 = selecti(B,p2 — l2) — (p2 — h)- The limits in the y axis are just [0, depth{v) — 1]. 
Thus the grid area to query is determined in constant time. 

Once the range [ari,^] x [2/1,2/2] to query is determined, we proceed to the grid. 
We determine the wavelet tree nodes that cover the interval [2/1, 2/2] ? and map the 
interval [xi,^] to all of them. As there are at most two such nodes per level, there 
are O(loglogn) nodes covering the interval, and they are found in O(loglogn) time. 

We now use a top-k algorithm for wavelet trees [20]. Let vi,V2, ■ ■ ■ ,v s the wavelet 
tree nodes that cover [2/1,2/2] and let [a^,^] De t ne interval [£i,£2] mapped to V{. For 
each of them we compute RMQ w ^ Vi )(x{,x 2 ) to find the position Xi with the largest 
weight among the points in Vi, and find out that weight and the corresponding doc- 
ument, Wi and di. We set up a max-priority queue that will hold at most k elements 
(elements smaller than the kth are discarded by the queue). We initially insert the 
O(loglogn) tuples (fj, x\ , x 2 , Xi, Wi, di), being Wi the sort key. Now we iteratively ex- 
tract the tuple with the largest weight, say {y j, x\,x\, Xj, Wj, dj). We report the doc- 
ument dj with weight Wj, and create two new ranges in vf [x{, Xj — 1] and [xj + 1, x 3 2 ]- 
We compute their RMQ, find the corresponding documents and weights, and reinsert 
them in the queue. After k steps, we have reported the top-fc documents. 

Using a y-fast trie [28J for the priority queue, the total time is O(loglogn) to 
find the cover nodes, 0((loglogn) 2 ) to determine their tuples and insert them in the 
queue, and 0(/c log logn) to extract the minima, compute and reinsert new tuples. 

We remind that we have not stored the leaves in the grid. Therefore, if the 
procedure above yields less than k results, we must complete it with documents 

2 We conjecture that the number of bits is actually Oin), which we can prove only for uniformly 
distributed texts. 
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where the pattern appears only once. We use Muthukrishnan's algorithm [T7] with 
the RMQ structure on the C array. We extract distinct documents until we obtain 
k distinct documents in total, counting those already reported with the grid. This 
requires at most 2k steps, as we can revisit the documents reported with the grid. 
Each step requires O (log log n) time to compute the document identifier. 

Theorem 1 Given d documents concatenated into a text T[l,n], we can build an 
index requiring almost surely (Hk(T) + logd + 4 log log n) [n + o(n)) bits, which can 
report the top-k documents most relevant to a search pattern P[l,m] in time 0(m + 
(k + log log n) log log n) almost surely. Our structure can be built in time 0(n\oga + 
n log log n) (details omitted). 



5 Experiments and Results 

We compared our solution to the implementation of Navarro and Valenzuela 
which is the current state of the art. We use various compact data structures im- 



plementations from libcds (http : / / libcds . recoded . cl ). We used the following 



collections in our experiments. Their grid heights are between 5 and 9. 

DNA. A sequence of 10,000 highly repetitive (0.05% difference between docu- 
ments) synthetic DNA sequences with 100,030,004 bases in total. 
KGS. A collection of 18,383 sgf-formatted Go game records from year 2009 
flirt tp : / /www . u-go . net /game records] ), containing 26,351,161 chars. 
Proteins. A collection of 143,244 sequences of Human and Mouse Proteins 
flhttp : //www, ebi . ac .uk/swissprot ), containing 59,103,058 symbols. 



FT91-94. A sample of 40,000 documents from TREC Corpus FT91 to 94 
flhttp ://trec.nist. gov[ ) containing 93,498,090 characters. 
Wikipedia. A sample of 40,000 documents from the English Wikipedia con- 
taining 83,647,329 characters. 

The experiments were performed in an Intel(r) Xeon(r) model E5620 running at 
2.40 GHz with 96GB of RAM and 12,288KB cache. The operating system is Linux 
with kernel 2.6.31-41 64 bits and we used the GNU C compiler version 4.4.3 with -03 
optimization parameter. For queries, we selected 4,000 random substrings of length 
3 and 8, and obtained the top-k documents for each, for k = 10.. 100 every 10 values. 

Figure [T] shows time performance as a function of k. The time taken by the 
CSA search is always near 20 microseconds, after which the index takes about k 
microseconds. In some cases (KGS, or Wikipedia for m — 8) there are no enough 
results with frequency larger than 1, and document listing must be activated, which 
slows down the process to l.&k-Ak microseconds. Note that in practice one may wish 
to avoid listing those low-frequency documents anyway. 

Figure [2] (left) shows the fraction of space used by the different data structures 
employed: the CSA, the augmented wavelet tree (WT), the DAC-encoded frequen- 
cies (F), the suffix tree topology (T), the document identifiers (DOC), the mapping 
bitmaps B and L (M), the RMQ structure for Muthukrishnan's document listing (C), 
and the sparse bitmap D marking document limits. Figured (right), shows the size 
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Figure 1: Time performance as a function of k, for m = 3 (left) and m = 8 (right). 
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Figure 2: Space consumption of the different compact data structures employed (left) 
and the size ratio over the original dataset for the different collections (right). 



ratio over the original dataset (considering one byte per symbol). The values vary 
between 1.5 and 3 times the size of the collection. Note that, within this space, we 
can reproduce any document of the collection, as our CSA offers access to them. 

6 Final Remarks 

Table [XJ compares our solution with previous work [21] on the three collections shared, 
taking their best compressed (from their variant WT-Alpha+SS GST plus more recent 
improvements) and uncompressed (from their variant WT-Plain+SSGST) results^] 
Our structure is at most only 5% larger. When both use about the same space, our 
structure is 4 to 25 times faster. In other cases our structure can use up to half the 
space, and it is still faster, up to 3 times (for large k and m we must resort to much 
document listing, where their wavelet tree on documents is faster). 

Needless to say, this is a remarkable result for a structure that, in theory [19] . used 
about 80 times the collection size. We have sharply compressed it while retaining the 
best ideas that led to its optimal time. We believe this establishes a new direction 
in which research on space-efficient top-A; retrieval could be focused: Rather than 

3 In their paper [5T], the CSA space is not included. We have added ours for a fair comparison. 
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28.3 
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Table 1: Comparison to the best previous work [2T], giving the fraction of their space 
we use, and the speedup we obtain with respect to them. 

sampling the suffix tree nodes [TTJ [3TJ , threshold the document frequencies we store 
(curiously, this is closer in spirit to the first, superlinear-size, proposed top-fc solution 
[I~2| |9]). For example, can we discard all the frequencies below a threshold / and 
efficiently list them if needed? Our work shows this is possible at least for / = 1. 

Our approach easily extends to relevance functions other than term frequency. 
In most cases it is sufficient to store the appropriate weights in our data structure. 
Even if these are not compressible, the space should not grow up too much. Our 
structure also trivially solves other document listing problems, like fc-mining (list the 
documents where P appears at least k times). Muthukrishnan [17] solves it in optimal 
time 0(m + occ) and O(nlogn) bits for k fixed at indexing time. For variable k the 
space is 0(nlog 2 n). Our compressed structure, without modifications, solves both 
variants in time 0{m+ {occ + log log n) log log n). 
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